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Abstract We present Searn, an algorithm for integrating SEARch and 
lEARNing to solve complex structured prediction problems such as those that 
occur in natural language, speech, computational biology, and vision. Searn 
is a meta-algorithm that transforms these complex problems into simple 
classification problems to which any binary classifier may be applied. Unlike 
current algorithms for structured learning that require decomposition of 
both the loss function and the feature functions over the predicted structure, 
Searn is able to learn prediction functions for any loss function and any 
class of features. Moreover, Searn comes with a strong, natural theoretical 
guarantee: good performance on the derived classification problems implies 
good performance on the structured prediction problem. 

1 Introduction 

Prediction is the task of learning a function / that maps inputs x in an input 
domain X to outputs y in an output domain y. Standard algorithms — 
support vector machines, decision trees, neural networks, etc. — focus on 
"simple" output domains such as y = {— 1,+1} (in the case of binary 
classification) or y = K (in the case of univariate regression). 

We are interested in problems for which elements y £ y have complex 
internal structure. The simplest and best studied such output domain is 
that of labeled sequences. However, we are interested in even more complex 
domains, such as the space of English sentences (for instance in a machine 
translation application) , the space of short documents (perhaps in an auto- 
matic document summarization application), or the space of possible assign- 
ments of elements in a database (in an information extraction/data mining 
application). The structured complexity of features and loss functions in 
these problems significantly exceeds that of sequence labeling problems. 
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From a high level, there are four dimensions along which structured pre- 
diction algorithms vary: structure (varieties of structure for which efficient 
learning is possible), loss (different loss functions for which learning is possi- 
ble) , features (generality of feature functions for which learning is possible) 
and data (ability of algorithm to cope with imperfect data sources such as 
missing data, etc.). An in-depth discussion of alternative structured predic- 
tion algorithms is given in Section 5. However, to give a flavor, the popular 
conditional random field algorithm [29] is viewed along these dimensions as 
follows. Structure: inference for a CRF is tractable for any graphical model 
with bounded tree width; Loss: the CRF typically optimizes a log-loss ap- 
proximation to 0/1 loss over the entire structure; Features: any feature of 
the input is possible but only output features that obey the graphical model 
structure are allowed; Data: EM can cope with hidden variables. 

We prefer a structured prediction algorithm that is not limited to models 
with bounded treewidth, is applicable to any loss function, can handle ar- 
bitrary features and can cope with imperfect data. Somewhat surprisingly, 
Searn meets nearly all of these requirements by transforming structured 
prediction problems into binary prediction problems to which a vanilla bi- 
nary classifier can be applied. Searn comes with a strong theoretical guar- 
antee: good binary classification performance implies good structured pre- 
diction performance. Simple applications of Searn to standard structured 
prediction problems yield tractable state-of-the-art performance. Moreover, 
we can apply Searn to more complex, non-standard structured prediction 
problems and achieve excellent empirical performance. 

This paper has the following outline: 

1. Introduction. 

2. Core Definitions. 

3. The Searn Algorithm. 

4. Theoretical Analysis. 

5. A Comparison to Alternative Techniques. 

6. Experimental results. 

7. Discussion. 



2 Core Definitions 

In order to proceed, it is useful to formally define a structured prediction 
problem in terms of a state space. 

Definition 1 A structured prediction problem V is a cost-sensitive clas- 
sification problem where y has structure: elements y e y decompose into 
variable-length vectors (2/1,2/2, ■ • ■ tUt)- 1 V is a distribution over inputs x £ 
X and cost vectors c, where |c| is a variable in 2 T . 

1 Treating y as a vector is simply a useful encoding; we are not interested only 
in sequence labeling problems. 
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As a simple example, consider a parsing problem under Fi loss. In this 
case, I? is a distribution over (x, c) where x is an input sequence and for all 
trees y with |a;|-many leaves, c y is the Fi loss of y on the "true" output. 

The goal of structured prediction is to find a function h : X — > y that 
minimizes the loss given in Eq (1). 

L(D, h) = E( XiC )^d {c h(x )} (1) 

The algorithm we present is based on the view that a vector y £ y can 
be produced by predicting each component {yi, ■ ■ ■ ,Ut) m turn, allowing 
for dependent predictions. This is important for coping with general loss 
functions. For a data set (xi, Ci), . . . , (xn, Cat) of structured prediction ex- 
amples, we write T n for the length of the longest search path on example 
n, and T max = max„ T n . 



3 The Searn Algorithm 

There are several vital ingredients in any application of Searn: a seach 
space for decomposing the prediction problem; a cost sensitive learning al- 
gorithm; labeled structured prediction training data; a known loss function 
for the structured prediction problem; and a good initial policy. These as- 
pects are described in more detail below. 

A search space S. The choice of search space plays a role similar to the 
choice of structured decomposition in other algorithms. Final elements 
of the search space can always be referenced by a sequence of choices y . In 
simple applications of Searn the search space is concrete. For example, 
it might consist of the parts of speech of each individual word in a 
sentence. In general, the search space can be abstract, and we show this 
can be beneficial experimentally. An abstract search space comes with 
an (unlearned) function f(y) which turns any sequence of predictions 
in the abstract search space into an output of the correct form. (For 
a concrete search space, / is just the identity function. To minimize 
confusion, we will leave off / in future notation unless its presence is 
specifically important.) 

A cost sensitive learning algorithm A. The learning algorithm returns a mul- 
ticlass classifier h(s) given cost sensitive training data. Here s is a de- 
scription of the location in the search space. A reduction of cost sen- 
sitive classification to binary classification [4] reduces the requirement 
to a binary learning algorithm. Searn relies upon this learning algo- 
rithm to form good generalizations. Nothing else in the Searn algorithm 
attempts to achieve generalization or estimation. The performance of 
Searn is strongly dependent upon how capable the learned classifier is. 
We call the learned classifier a policy because it is used multiple times 
on inputs which it effects, just as in reinforcement learning. 
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Labeled structured prediction training data. Searn digests the labeled train- 
ing data for the structured prediciton problem into cost-sensitive train- 
ing data which is fed into the cost-sensitive learning algorithm. 2 

A known loss function. A loss function L(y, f{y)) must be known and be 
computable for any sequence of predictions. 

A good initial policy. This policy should achieve low loss when applied to 
the training data. This can (but need not always) be defined using a 
search algorithm. 

3.1 Searn at Test Time 

Searn at test time is a very simple algorithm. It uses the policy returned 
by the learning algorithm to construct a sequence of decisions y and makes 
a final prediction f(y). First, one uses the learned policy to compute yo on 
the basis of just the input x. One then computes y\ on the basis of x and 
yo, followed by predicting y 2 on the basis of x, yo and yi, etc. Finally, one 
predicts yx on the basis of the input x and all previous decisions. 

3.2 Searn at Train Time 

Searn operates in an iterative fashion. At each iteration it uses a known 
policy to create new cost-sensitive classification examples. These examples 
are essentially the classification decisions that a policy would need to get 
right in order to perform search well. These are used to learn a new classifier, 
which is interpreted as a new policy. This new policy is interpolated with 
the old policy and the process repeats. 

3.2.1 Initial Policy Searn relies on a good initial policy on the training 
data. This policy can take full advantage of the training data labels. The 
initial policy needs to be efficiently computable for Searn to be efficient. 
The implications of this assumption are discussed in detail in Section 3.4.1, 
but it is strictly weaker than assumptions made by other structured predic- 
tion techniques. The initial policy we use is a policy that, for a given state 
predicts the best action to take with respect to the labels: 

Definition 2 (Initial Policy) For an input x and a cost vector c as in 
Def 1, and a state s = x x (j/i, . . . , y t ) in the search space, the initial policy 
n(s,c) is aigmmy t+1 mm yt+2 y T C( yi yT y That is, ir chooses the action 
(i.e., value for y t +\) that minimizes the corresponding cost, assuming that 
all future decisions are also made optimally. 

This choice of initial policy is optimal when the correct output is a 
deterministic function of the input features (effectively in a noise-free envi- 
ronment). 

2 A fc-class cost-sensitive example is given by an input X and a vector of costs 
c 6 (R + ) fc . Each class i has an associated cost a and the goal is a function 
h '. X i — ► % that minimizes the expected value of a. See [4]. 
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3.2.2 Cost-sensitive Examples In the training phase, Searn uses a given 
policy h (initialized to the the initial policy ir) to construct cost-sensitive 
multiclass classification examples from which a new classifier is learned. 
These classification examples are created by running the given policy h over 
the training data. This generates one path per structured training example. 
Searn creates a single cost-sensitive example for each state on each path. 
The classes associated with each example are the available actions (or next 
states). The only difficulty lies in specifying the costs. 

The cost associated with taking an action that leads to state s is the 
regret associated with this action, given our current policy. For each state 
s and each action a, we take action a and then execute the policy to gain 
a full sequence of predictions y for which we can compute a loss c y . Of 
all the possible actions, one, a', has the minimum expected loss. The cost 
£h(c, s, a) for an action a in state s is the difference in loss between taking 
action a and taking the action a'; see Eq (2). 

4(c, s, a) = E y ^ (s ^ h)Cy - mmE y ^ a , yh)Cy (2) 

One complication arises because the policy used may be stochastic. This 
can occur even when the base classifier learned is deterministic due to 
stochastic interpolation within Searn. There are (at least) three possible 
ways to deal with randomness. 

1. Monte-Carlo sampling: one draws many paths according to h beginning 
at s' and average over the costs. 

2. Single Monte-Carlo sampling: draw a single path and use the correspond- 
ing cost, with tied randomization as per Pegasus [42]. 

3. Approximation: it is often possible to efficiently compute the loss as- 
sociated with following the initial policy from a given state; when h 
is sufficiently good, this may serve as a useful and fast approximation. 
(This is also the approach described by [30].) 

The quality of the learned solution depends on the quality of the ap- 
proximation of the loss. Obtaining Monte-Carlo samples is likely the best 
solution, but in many cases the approximation is sufficient. An empirical 
comparison of these options is performed in [12]. Here it is observed that 
for easy problems (one for which low loss is possible), the approximation 
performs approximately as well as the alternatives. Moreover, typically the 
approximately outperforms the single sample approach, likely due to the 
noise induced by following a single sample. 

3.2.3 Algorithm The Searn algorithm is shown in Figure 1. As input, the 
algorithm takes a structured learning data set, an initial policy n and a 
multiclass cost sensitive learner A. Searn operates iteratively, maintaining 
a current policy hypothesis h at each iteration. This hypothesis is initialized 
to the initial policy (step 1). 
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Algorithm Searn(5 , ir, A) 
1: Initialize policy h <— 7T 

2: while h has a significant dependence on 7r do 

3: Initialize the set of cost-sensitive examples S <— 

4: for (x, y) G 5 SP do 

5: Compute predictions under the current policy y ~ x, ft 
6: for t = 1 ... do 

7: Compute features $ = <£(st) for state s t = (x,yi, j/t) 

8: Initialize a cost vector c = () 

9: for each possible action a do 

10: Let the cost £ a for example x, c at state s be £?j(c, s, a) 

11: end for 

12: Add cost-sensitive example (<P, to S 

13: end for 
14: end for 

15: Learn a classifier on S: h' <— A(S) 
16: Interpolate: h <- f3h' + (1 - 
17: end while 
18: return /ii ast without 7r 



Fig. 1 Complete Searn Algorithm 



The algorithm then loops for a number of iterations. In each iteration, 
it creates a (multi-)set of cost-sensitive examples, S. These are created by 
looping over each structured example (step 4). For each example (step 5), 
the current policy h is used to produce a full output, represented as a se- 
quence of predictions y\, ...,yT n - From this, states are derived and used to 
create a single cost-sensitive example (steps 6-14) at each timestep. 

The first task in creating a cost-sensitive example is to compute the 
associated feature vector, performed in step 7. This feature vector is based 
on the current state St which includes the features x (the creation of the 
feature vectors is discussed in more detail in Section 3.3). The cost vector 
contains one entry for every possible action a that can be executed from 
state s t - For each action a, we compute the expected loss associated with 
the state s t © a: the state arrived at assuming we take action a (step 10). 

Searn creates a large set of cost-sensitive examples S. These are fed into 
any cost-sensitive classification algorithm, A, to produce a new classifier h' 
(step 15). In step 16, Searn combines the newly learned classifier h' with 
the current classifier h to produce a new classifier. This combination is 
performed through stochastic interpolation with interpolation parameter (3 
(see Section 4 for details). The meaning of stochastic interpolation here is: 
"every time h is evaluated, a new random number is drawn. If the random 
number is less than (3 then h! is used and otherwise the old h is used." 
Searn returns the final policy with tt removed (step 18) and the stochastic 
interpolation renormalized. 
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3.3 Feature Computations 

In step 7 of the Searn algorithm (Figure 1), one is required to compute 
a feature vector & on the basis of the give state s t . In theory, this step is 
arbitrary. However, the performance of the underlying classification algo- 
rithm (and hence the induced structured prediction algorithm) hinges on 
a good choice for these features. The feature vector <&(s t ) may depend on 
any aspect of the input x and any past decision. In particular, there is no 
limitation to a "Markov" dependence on previous decisions. 

For concreteness, consider the part-of-speech tagging task: for each word 
in a sentence, we must assign a single part of speech (eg., Det, Noun, Verb, 
etc.). Given a state s t — (x,yi, . . .y t ), one might compute a sparse feature 
vector <P(s t ) with zeros everywhere except at positions corresponding to 
"interesting" aspects of the input. For instance, a feature corresponding to 
the identity of the i+lst word in the sentence would likely be very important 
(since this is the word to be tagged). Furthermore, a feature corresponding 
to the value y t would likely be important, since we believe that subsequent 
tags are not independent of previous tags. These features would serve as 
the input to the cost-sensitive learning algorithm, which would attempt to 
predict the correct label for the t + 1st word. This usually corresponds to 
learning a single weight vector for each class (in a one-versus-all setting) or 
to learning a single weight vector for each pair of classes (for all-pairs). 

3.4 Policies 

Searn functions in terms of policies, a notion borrowed from the field of 
reinforcement learning. This section discusses the nature of the initial policy 
assumption and the connections to reinforcement learning. 

3.4-1 Computability of the Initial Policy Searn relies upon the ability to 
start with a good initial policy n, defined formally in Definition 2. For 
many simple problems under standard loss functions, it is straightforward 
to compute a good policy ir in constant time. For instance, consider the 
sequence labeling problem (discussed further in Section 6.1). A standard 
loss function used in this task is Hamming loss: of all possible positions, 
how many does our model predict incorrectly. If one performs search left- 
to- right, labeling one element at a time (i.e., each element of the y vector 
corresponds exactly to one label), then ir is trivial to compute. Given the 
correct label sequence, ir simply chooses at position i the correct label at 
position i. However, Searn is not limited to simple Hamming loss. A more 
complex loss function often considered for the sequence segmentation task 
is F-score over (correctly labeled) segments. As discussed in Section 6.1.3, 
it is just as easy to compute a good initial policy for this loss function. 
This is not possible in many other frameworks, due to the non-additivity of 
F-score. This is independent of the features. 
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This result — that Searn can learn under strictly more complex struc- 
tures and loss functions than other techniques — is not limited to sequence 
labeling, as demonstrated below in Theorem 1. In order to prove this, we 
need to formalize what we consider as "other techniques." We use the max- 
margin Markov network (M 3 N) formalism [51] for comparison, since this 
currently appears to be the most powerful generic framework. In particular, 
learning in M 3 Ns is often tractable for problems that would be #P-hard for 
conditional random fields. The M 3 N has several components, one of which 
is the ability to compute a loss-augmented minimization [51]. This require- 
ment states that Eq (3) is computable for any input x, output set y x , true 
output y and weight vector w. 

opt(y x ,y, w) = arg max w T <P(x, y) - l(y, y) (3) 

In Eq (3), <!>{■) produces a vector of features, w is a weight vector and 
l(y,y) is the loss for prediction y when the correct output is y. 

Theorem 1 Suppose Eq (3) is computable in time T(x); then the opti- 
mal policy is computable in time 0(T(x)). Further, there exist problems 
for which the optimal policy is computable in constant time and for which 
Eq (3) is an NP-hard computation. 

Proof (sketch) For the first part, we use a vector encoding of y that main- 
tains the decomposition over the regions used by the M 3 N. Given a prefix 
yi, . . . , yt, solve opt on the future choices (i.e., remove the part of the struc- 
ture corresponding to the first t outputs), which gives us an optimal policy. 

For the second part, we simply make <P complex: for instance, include 
long-range dependencies in sequence labeling. As the Markov order k in- 
creases, the complexity of Viterbi decoding grows as l k , where I is the num- 
ber of labels. In the limit as the Markov order approaches the length of the 
longest sequence, T max , the computation for the minimal cost path (with 
or without the added complexity of augmenting the cost with the loss) be- 
comes NP-hard. Despite this intractability for Viterbi decoding, Searn can 
be applied to the identical problem with the exact same feature set, and in- 
ference becomes tractable (precisely because Searn never applies a Viterbi 
algorithm). The complexity of one iteration of Searn for this problem is 
identical to the case when a Markov assumption is made: it is OTl b , where 
T is the length of the sequence, and b is the beam size. 

3.4-2 Search-based Policies The Searn algorithm and the theory to be 
presented in Section 4 do not require that the initial policy be optimal. 
Searn can train against any policy. One artifact of this observation is that 
we can use search to create the initial policy. 

At any step of Searn, we need to be able to compute the best next 
action. That is, given a node in the search space, and the cost vector c, we 
need to compute the best step to take. This is exactly the standard search 
problem: given a node in a search space, we find the shortest path to a 
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goal. By taking the first step along this shortest path, we obtain a good 
initial policy (assuming this shortest path is, indeed, shortest). This means 
that when Searn asks for the best next step, one can execute any standard 
search algorithm to compute this, for cases where a good initial policy is 
not available analytically. 

Given this observation, the requirements of Searn are reduced: instead 
of requiring a good initial policy, we simply require that one can perform 
efficient approximate search. 

3.4-3 Beyond Greedy Search We have presented Searn as an algorithm 
that mimics the operations of a greedy search algorithm. Real-world expe- 
rience has shown that often greedy search is insufficient and more complex 
search algorithms arc required. This observation is consistent with the stan- 
dard view of search (trying to find a shortest path) , but nebulous when con- 
sidered in the context of Searn. Nevertheless, it is often desirable to allow 
a model to trade past decisions off future decisions, and this is precisely the 
purpose of instituting more complex search algorithms. 

It turns out that any (non-greedy) search algorithm operating in a search 
space S can be cquivalently viewed as a greedy search algorithm operating in 
an abstract space S* (where the structure of the abstract space is dependent 
on the original search algorithm). In a general search algorithm [47], one 
maintains a queue of active states and expands a single state in each search 
step. After expansion, each resulting child state is enqueued. The ordering 
(and, perhaps, maximal size) of the queue is determined by the specific 
search algorithm. 

In order to simulate this more complex algorithm as greedy search, we 
construct the abstract space S* as follows. Each node s £ S* represents a 
state of the queue. A transition exists between s and s' in S* exactly when 
a particular expansion of an 5-node in the s-queue results in the queue 
becoming s' . Finally, for each goal state g £ S, we augment S* with a single 
unique goal state g*. We insert transitions from s £ S* to g* exactly when 
g* £ s. Thus, in order to complete the search process, a goal node must be 
in the queue and the search algorithm must select this single node. 

In general, Searn makes no assumptions about how the search process is 
structured. A different search process leads to a different bias in the learning 
algorithm. It is up to the designer to construct a search process so that (a) 
a good bias is exhibited and (b) computing a good initial policy is easy. For 
instance, for some combinatorial problems such as matchings or tours, it 
is known that left-to-right beam search tends to perform poorly. For these 
problems, a local hill-climbing search is likely to be more effective since we 
expect it to render the underlying classification problem simpler. 

4 Theoretical Analysis 

Searn functions by slowly moving away from the initial policy (which is 
available only for the training data) toward a fully learned policy. Each 
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iteration of Searn degrades the current policy. The main theorem states 
that the learned policy is not much worse than the starting (optimal) policy 
plus a term related to the average cost sensitive loss of the learned classifiers 
and another term related to the maximum cost sensitive loss. To simplify 
notation, we write T for T max . 

It is important in the analysis to refer explicitly to the error of the 
classifiers learned during Searn process. Let Searn(2?, h) denote the dis- 
tribution over classification problems generated by running Searn with 
policy h on distribution V. Also let £^ s (h') denote the loss of classifier h! 
on the distribution Searn (2?, h). Let the average cost sensitive loss over I 
iterations be: 



where hi is the ith policy and h\ is the classifier learned on the ith iteration. 

Theorem 2 For all V with c max — E( I C )^max s c y (with (x,c) as in 
Def 1), for all learned cost sensitive classifiers h! , Searn with (3 — 1/T 3 
and 2T 3 In T iterations, outputs a learned policy with loss bounded by: 



The dependence on T in the second term is due to the cost sensitive 
loss being an average over T timesteps while the total loss is a sum. The 
InT factor is not essential and can be removed using other approaches [3] 
[30] . The advantage of the theorem here is that it applies to an algorithm 
that naturally copes with variable length T and yields a smaller amount of 
computation in practice. 

The choices of (3 and the number of iterations are pessimistic in practice. 
Empirically, we use a development set to perform a line search minimization 
to find per-iteration values for (3 and to decide when to stop iterating. The 
analytical choice of (3 is made to ensure that the probability that the newly 
created policy only makes one different choice from the previous policy for 
any given example is sufficiently low. The choice of (3 assumes the worst: 
the newly learned classifier always disagrees with the previous policy. In 
practice, this rarely happens. After the first iteration, the learned policy 
is typically quite good and only rarely differs from the initial policy. So 
choosing such a small value for j3 is unneccesary: even with a higher value, 
the current classifier often agrees with the previous policy. 

The proof rests on the following lemmae. 

Lemma 1 (Policy Degradation) Given a policy h with loss L(D, h), ap- 
ply a single iteration of Searn to learn a classifier hi with cost-sensitive 
loss lfo S (h'). Create a new policy h new by interpolation with parameter [3 G 
(0, 1/T). Then, for all V, with c max — , c )^v max^ a (with (x,c) as in 




(4) 



L(V, hiast) < L(V, tt) + 2T£ avg \nT + (1 + In T)c max /T 



Def 1): 
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L(V, h new ) < L(V, h) + T(3^ s (h') + \fT 2 c max (5) 

Proof The proof largely follows the proofs of Lem 6.1 and Theorem 4.1 for 
conservative policy iteration [23]. The three differences are that (1) we must 
deal with the finite horizon case; (2) we move away from rather than toward 
a good policy; and (3) we expand to higher order. 

The proof works by separating three cases depending on whether h cs 
or h is called in the process of running /i new . The easiest case is when h cs 
is never called. The second case is when it is called exactly once. The final 
case is when it is called more than once. Denote these three events by c = 0, 
c = 1 and c > 2, respectively. 



L(V, h ncw ) =Pr(c = 0)L(V, h ncw | c = 0) 

+ Pr(c= l)L(V,h acw | c=l) 
+ Pr(c > 2)L(V, h acw | c > 2) 



(6) 



<(1 - P) T L(V, h) + T/3(l - /3) T - 1 [L(V, h) + £% s (h') ( 7) 
+ [l - (1 - /3) T - T0(1 - pf- l] 



=L(^,/ l ) + T/3(l-/3) T - 1 ^S(/ l ') (8) 
+ [l - (1 - Pf - T(3(l - pf" 1 ] (c max - L{V, h)) 



<L(V, h)+Tf3l^{ti) 

+ [l - (1 - /?) T - T0(1 - P) 



T-l 



(9) 



(10) 



<L(V, h)+TI3t^{h') + -T z (3 z c u 



(11) 



The first inequality writes out the precise probability of the events in 
terms of (5 and bounds the loss of the last event (c > 2) by c max . The second 
inequality is algebraic. The third uses the assumption that (3 < 1/T. 



This lemma states that applying a single iteration of Searn does not cause 
the structured prediction loss of the learned hypothesis to degrade too much. 
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In particular, up to a first order approximation, the loss increases propor- 
tional to the loss of the learned classifier. This observation can be iterated 
to yield the following lemma: 

Lemma 2 (Iteration) For all V, for all learned h! , after C/[3 iterations 
of Searn beginning with a policy tt with loss L(V,tt), and average learned 
losses as Eq (4) , the loss of the final learned policy h ( without the optimal 
policy component) is bounded by Eq (12). 



This lemma states that after C / [3 iterations of Searn the learned policy 
is not much worse than the quality of the initial policy tt. The theorem 
follows from a choice of the constants (3 and C in Lemma 2. 

Proof The proof involves invoking Lemma 1 C / (3 times. The second and the 
third terms sum to give the following: 



Last, if we call the initial policy, we fail with loss at most c max . The prob- 
ability of failure after C//3 iterations is at most T(l - l3) c ^ < Texp[-C]. 

5 Comparison to Alternative Techniques 

Standard techniques for structured prediction focus on the case where the 
argmax in Eq (13) is tractable. Given its tractability, they attempt to 
learn parameters 9 such that solving Eq (13) often results in low loss. There 
are a handful of classes of such algorithms and a large number of variants 
of each. Here, we focus on independent classifier models, perceptron-based 
models, and global models (such as conditional random fields and max- 
margin Markov networks). There are, of course, alternative frameworks (see, 
eg., [58,36,1,39,54]), but these are common examples. 

5.1 The argmax Problem 

Many structured prediction problems construct a scoring function F(y \ x,0). 
For a given input x £ X and set of parameters 9 £ 0, F provides a score for 
each possible output y. This leads to the "argmax" problem (also known 
as the decoding problem or the pre-image problem) , which seeks to find the 
y that maximizes F in order to make a prediction. 




(12) 




y = arg max F(y \ x, 9) 



(13) 
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In Eq (13), we seek the output y from the set y x (where 34 C y is 
the set of all "reasonable" outputs for the input x - typically assumed to 
be finite). Unfortunately, solving Eq (13) exactly is tractable only for very 
particular structures y and scoring functions F. As an easy example, when 
y x is interpreted as a label sequence and the score function F depends only 
on adjacent labels, then dynamic programming can be used, leading to an 
0(nk 2 ) prediction algorithm, where n is the length of the sequence and k is 
the number of possible labels for each element in the sequence. Similarly, if 
y represents trees and F obeys a context-free assumption, then this problem 
can be solved in time 0(n 3 k). 

Often we are interested in more complex structures, more complex fea- 
tures or both. For such tasks, an exact solution to Eq (13) is not tractable. 
For example, In natural language processing most statistical word-based and 
phrase-based models of translation are known to be NP-hard [19]; syntactic 
translations models based on synchronous context free grammars are some- 
times polynomial, but with an exponent that is too large in practice, such as 
n 11 [21]. Even in comparatively simple problems like sequence labeling and 
parsing — which are only 0(n) or 0(n 3 ) — it is often still computationally 
prohibitive to perform exhaustive search [5]. For another sort of example, 
in computational biology, most models for phylogeny [17] and protein sec- 
ondary structure prediction [10] result in NP-hard search problems. 

When faced with such intractable search problem, the standard tactic 
is to use an approximate search algorithm, such as greedy search, beam 
search, local hill-climbing search, simulated annealing, etc. These search 
algorithms are unlikely to be provably optimal (since this would imply that 
one is efficiently solving an NP-hard problem), but the hope is that they 
perform well on problems that are observed in the real world, as opposed 
to "worst case" inputs. 

Unfortunately, applying suboptimal search algorithms to solve the struc- 
tured prediction problem from Eq (13) dispenses with many nice theoretical 
properties enjoyed by sophisticated learning algorithms. For instance, it may 
be possible to learn Bayes-optimal parameters such that if exact search 
were possible, one would always find the best output. But given that exact 
search is not possible, such properties go away. Moreover, given that dif- 
ferent search algorithms exhibit different properties and biases, it is easy 
to believe that the value of 9 that is optimal for one search algorithm is 
not the same as the value that is optimal for another search algorithm. 3 It 
is these observations that have motivated our exploration of search-based 
structured prediction algorithms: learning algorithms for structured predic- 
tion that explicitly model the search process. 



3 In fact, [57] has provided evidence that when using approximation algorithms 
for graphical models, it is important to use the same approximate at both training 
and testing time. 
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5.2 Independent Classifiers 

There are essentially two varieties of local classification techniques applied 
to structured prediction problems. In the first variety, the structure in the 
problem is ignored, and a single classifier is trained to predict each ele- 
ment in the output vector independently [43] or with dependence created 
by enforcement of membership in y§ constraints [45] . The second variety is 
typified by maximum entropy Markov models [37], though the basic idea of 
MEMMs has also been applied more generally to SVMs [27,28,20]. In this 
variety, the elements in the prediction vector are made sequentially, with 
the nth element conditional on outputs n — k . . . n — 1 for a fcth order model. 

In the purely independent classifier setting, both training and testing 
proceed in the obvious way. Since the classifiers make one decision com- 
pletely independently of any other decision, training makes use only of the 
input. This makes training the classifiers incredibly straightforward, and 
also makes prediction easy. In fact, running Searn with <P(x, y) indepen- 
dent of all but y n for the n prediction would yield exactly this framework 
(note that there would be no reason to iterate Searn in this case). While 
this renders the independent classifiers approach attractive, it is also signif- 
icantly weaker, in the sense that one cannot define complex features over 
the output space. This has not thus far hindered its applicability to prob- 
lems like sequence labeling [43] , parsing and semantic role labeling [44] , but 
does seem to be an overly strict condition. This also limits the approach to 
Hamming loss. 

Searn is more similar to the MEMM-esque prediction setting. The key 
difference is that in the MEMM, the nth prediction is being made on the ba- 
sis of the k previous predictions. However, these predictions are noisy, which 
potentially leads to the suboptimal performance described in the previous 
section. The essential problem is that the models have been trained assum- 
ing that they make all previous predictions correctly, but when applied in 
practice, they only have predictions about previous labels. It turns out that 
this can cause them to perform nearly arbitrarily badly. This is formalized 
in the following theorem, due to Matti Kaariaincn. 

Theorem 3 ([22]) There exists a distribution V over first order binary 
Markov problems such that training a binary classifier based on true previous 
predictions to an error rate of e > leads to a Hamming loss given in 
Eq (14), where T is the length of the sequence. 

T l-(l-2e)^ I T 

2 4e 2 2 v ' 

Where the approximation is true for small e or large T. 

Recently, [7] has described an algorithm termed stacked sequential learn- 
ing that attempts to remove this bias from MEMMs in a similar fashion 
to Searn. The stacked algorithm learns a sequence of MEMMs, with the 
model trained on the t + 1st iteration based on outputs of the model from 
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the tth iteration. For sequence labeling problems, this is quite similar to 
the behaviour of Searn when j3 is set to 1. However, unlike Searn, the 
stacked sequential learning framework is effectively limited to sequence la- 
beling problems. This limitation arises from the fact that it implicitly as- 
sumes that the set of decisions one must make in the future are always going 
to be same, regardless of decisions in the past. In many applications, such as 
entity detection and tracking [15], this is not true. The set of possible choices 
(actions) available at time step i is heavily dependent on past choices. This 
makes the stacked sequential learning inapplicable in these problems. 

5.3 Perceptr on- Style Algorithms 

The structured perceptron is an extension of the standard perceptron [46] to 
structured prediction [8]. Assuming that the argmax problem is tractable, 
the structured perceptron constructs the weight vector in nearly an identical 
manner as for the binary case. While looping through the training data, 
whenever the predicted y n for x n differs from y n , we update the weights 
according to Eq (15). 

w <- w + ${x n ,y n ) - $(x n ,y n ) (15) 

This weight update serves to bring the vector closer to the true out- 
put and further from the incorrect output. As in the standard perceptron, 
this often leads to a learned model that generalizes poorly. As before, one 
solution to this problem is weight averaging [18]. 

The incremental perceptron [9] is a variant on the structured perceptron 
that deals with the issue that the argmax may not be analytically available. 
The idea of the incremental perceptron is to replace the arg max with a beam 
search algorithm. The key observation is that it is often possible to detect in 
the process of executing search whether it is possible for the resulting output 
to ever be correct. The incremental perceptron is essentially a search-based 
structured prediction technique, although it was initially motivated only 
as a method for speeding up convergence of the structured perceptron. In 
comparison to Searn, it is, however, much more limited. It cannot cope 
with arbitrary loss functions, and is limited to a beam-search application. 
Moreover, for search problems with a large number of internal decisions 
(such as entity detection and tracking [15]), aborting search at the first 
error is far from optimal. 

5.4 Global Prediction Algorithms 

Global prediction algorithms attempt to learn parameters that, essentially, 
rank correct (low loss) outputs higher than incorrect (high loss) alternatives. 

Conditional random fields are an extension of logistic regression (maxi- 
mum entropy models) to structured outputs [29]. Similar to the structured 
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perceptron, a conditional random field does not employ a loss function, but 
rather optimizes a log- loss approximation to the 0/1 loss over the entire out- 
put. Only when the features and structure are chosen properly can dynamic 
programming techniques be used to compute the required partition func- 
tion, which typically limits the application of CRFs to linear chain models 
under a Markov assumption. 

The maximum margin Markov network (M 3 N) formalism considers the 
structured prediction problem as a quadratic programming problem [52,51], 
following the formalism for the support vector machine for binary classifica- 
tion. The M 3 N formalism extends this to structured outputs under a given 
loss function I by requiring that the difference in score between the true 
output y and any incorrect output y is at least the loss l(x,y,y) (modulo 
slack variables). That is: the M 3 N framework scales the margin to be pro- 
portional to the loss. Under restrictions on the output space and the features 
(essentially, linear chain models with Markov features) it is possible to solve 
the corresponding quadratic program in polynomial time. 

In comparison to CRFs and M 3 Ns, Searn is strictly more general. 
Searn is limited neither to linear chains nor to Markov style features and 
can effectively and efficiently optimize structured prediction models under 
far weaker assumptions (see Section 6.2 for empirical evidence supporting 
this claim). 

6 Experimental Results 

In this section, we present experimental results on two different sorts of 
structured prediction problems. The first set of problems — the sequence la- 
beling problems — are comparatively simple and are included to demonstrate 
the application of Searn to easy tasks. They are also the most common ap- 
plication domain on which other structured prediction techniques are tested; 
this enables us to directly compare Searn with alternative algorithms on 
standardized data sets. The second application we describe is based on an 
automatic document summarization task, which is a significantly more com- 
plex domain than sequence labeling. This task enables us to test Searn on 
significantly more complex problems with loss functions that do not decom- 
pose over the structure. 

6.1 Sequence Labeling 

Sequence labeling is the task of assigning a label to each element in an input 
sequence. Sequence labeling is an attractive test bed for structured predic- 
tion algorithms because it is the simplest non-trivial structure. Modern 
state-of-the-art structured prediction techniques fare very well on sequence 
labeling problems. In this section, we present a range of results investigat- 
ing the performance of Searn on four separate sequence labeling tasks: 
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Fig. 2 Eight example words from the handwriting recognition data set. 



handwriting recognition, named entity recognition (in Spanish), syntactic 
chunking and joint chunking and part-of-speech tagging. 

For pure sequence labeling tasks (i.e., when segmentation is not also 
done), the standard loss function is Hamming loss, which gives credit on a 
per label basis. For a true output y of length TV and hypothesized output y 
(also of length N), Hamming loss is defined according to Eq (16). 



N 

i Ham (y,y) = £i[yn^w»] (16) 

71=1 

The most common loss function for joint segmentation and labeling prob- 
lems (like the named entity recognition and syntactic chunking problems) is 
Fi measure over chunks 4 . Fi is the geometric mean of precision and recall 
over the (properly-labeled) chunk identification task, given in Eq (17). 

F( } 4 2\yny\ 

[y,y) \y\ + \y\ 1 ] 

As can be seen in Eq (17), one is penalized both for identifying too many 
chunks (penalty in the denominator) and for identifying too few (penalty 
in the numerator). The advantage of Fi measure over Hamming loss seen 
most easily in problems where the majority of words are "not chunks" — 
for instance, in gene name identification [40] — Hamming loss often prefers 
a system that identifies no chunks to one that identifies some correctly 
and other incorrectly. Using a weighted Hamming loss can not completely 
alleviate this problem, for essentially the same reasons that a weighted zero- 
one loss cannot optimize Fi measure in binary classification, though one can 
often achieve an approximation [31,41]. 



4 We note in passing that directly optimizing Fi may not be the best approach, 
from the perspective of integrating information in a pipeline [35]. However, since 
Fi is commonly used and does not decompose over the output sequence, we use 
it for the purposes of demonstration. 
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El presidente de la [Junta de Extremadura]oRG , [Juan Carlos Rodriguez Ibarra] per 
, recibira en la sede de la [Presidencia del Gobierno]oRG extremefio a familiares 
de varios de los condenados por el proceso " [Lasa-Zabala]Misc " , entre ellos 
a [Lourdes Di'ez Urraca] PE R , esposa del ex gobernador civil de [Guipuzcoa] L oc 
[Julen Elgorriaga] PE R ; y a [Antonio Rodriguez Galindo] PE R , hermano del general 
[Enrique Rodriguez Galindo]pER . 

Fig. 3 Example labeled sentence from the Spanish Named Entity Recognition 
task. 

6.1.1 Handwriting Recognition The handwriting recognition task we con- 
sider was introduced by [25]. Later, [52] presented state-of-the-art results on 
this task using max-margin Markov networks. The task is an image recogni- 
tion task: the input is a sequence of pre-segmented hand-drawn letters and 
the output is the character sequence ("a"-"z") in these images. The data 
set we consider is identical to that considered by [52] and includes 6600 
sequences (words) collected from 150 subjects. The average word contains 8 
characters. The images are 8x16 pixels in size, and rasterized into a binary 
representation. Example image sequences are shown in Figure 2 (the first 
characters are removed because they are capitalized). 

For each possible output letter, there is a unique feature that counts 
how many times that letter appears in the output. Furthermore, for each 
pair of letters, there is an "edge" feature counting how many times this pair 
appears adjacent in the output. These edge features are the only "structural 
features" used for this task (i.e., features that span multiple output labels). 
Finally, for every output letter and for every pixel position, there is a feature 
that counts how many times that pixel position is "on" for the given output 
letter. 

In the experiments, we consider two variants of the data set. The first, 
"small," is the problem considered by [52]. In the small problem, ten fold 
cross-validation is performed over the data set; in each fold, roughly 600 
words are used as training data and the remaining 6000 are used as test data. 
In addition to this setting, we also consider the "large" reverse experiment: 
in each fold, 6000 words are used as training data and 600 are used as test 
data. 

6.1.2 Spanish Named Entity Recognition The named entity recognition 
(NER) task is concerned with spotting names of persons, places and or- 
ganizations in text. Moreover, in NER we only aim to spot names and 
neither pronouns ("he") nor nominal references ("the President"). We use 
the CoNLL 2002 data set, which consists of 8324 training sentences and 
1517 test sentences; examples are shown in Figure 3. A 300-sentence sub- 
set of the training data set was previously used by [54] for evaluating the 
SVM struct framework in the context of sequence labeling. The small train- 
ing set was likely used for computational considerations. The best reported 
results to date using the full data set are due to [2]. We report results on 
both the "small" and "large" data sets. 
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[Great American]^ [said]vp [it]i\ip [increasedjyp [its loan-loss reserves]^ [by]pp [$ 
93 million]NP [after] pp [reviewing]vp [its loan portfolio] np , [raising]vp [its total loan 
and real estate reserves] np [to]pp [$ 217 million] np . 

Fig. 4 Example labeled sentence from the syntactic chunking task. 

The structural features used for this task are roughly the same as in the 
handwriting recognition case. For each label, each label pair and each label 
triple, a feature counts the number of times this element is observed in the 
output. Furthermore, the standard set of input features includes the words 
and simple functions of the words (case markings, prefix and suffix up to 
three characters) within a window of ±2 around the current position. These 
input features are paired with the current label. This feature set is fairly 
standard in the literature, though [2] report significantly improved results 
using a much larger set of features. In the results shown later in this section, 
all comparison algorithms use identical feature sets. 

6.1.3 Syntactic Chunking The final sequence labeling task we consider is 
syntactic chunking (for English), based on the CoNLL 2000 data set. This 
data set includes 8936 sentences of training data and 2012 sentences of test 
data. An example is shown in Figure 4. (Several authors have considered 
the noun-phrase chunking task instead of the full syntactic chunking task. 
It is important to notice the difference, though results on these two tasks 
are typically very similar, indicating that the majority of the difficulty is 
with noun phrases.) 

We use the same set of features across all models, separated into "base 
features" and "meta features." The base features apply to words individu- 
ally, while meta features apply to entire chunks. The standard base features 
used are: the chunk length, the word (original, lower cased, stemmed, and 
original-stem), the case pattern of the word, the first and last 1, 2 and 3 
characters, and the part of speech and its first character. We additionally 
consider membership features for lists of names, locations, abbreviations, 
stop words, etc. The meta features we use are, for any base feature b, b 
at position i (for any sub-position of the chunk), b before/after the chunk, 
the entire 6-sequence in the chunk, and any 2- or 3-gram tuple of 6s in the 
chunk. We use a first order Markov assumption (chunk label only depends 
on the most recent previous label) and all features are placed on labels, 
not on transitions. In the results shown later in this section, some of the 
algorithms use a slightly different feature set. In particular, the CRF-based 
model uses similar, but not identical features; see [50] for details. 

6.1.4 Joint Chunking and Tagging In the preceding sections, we considered 
the single sequence labeling task: to each element in a sequence, a single 
label is assigned. In this section, we consider the joint sequence labeling 
task. In this task, each element in a sequence is labeled with multiple tags. 
A canonical example of this task is joint POS tagging and syntactic chunking 
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Great B N N p P American]^ saidg 1 ^, itglpSp increased^/p itSgl^p loan-loss^ P 

NNS i IN aS noCD n- CD r, IN • • VBG PRP$ i NN 

roscrvcS|_ N p by B _ PP 3> B _ NP 93[: NP milhonf: NP after B _ PP reviewmg B _vp its B _ NP loan,_ NP 
portfolio]^ : 

Fig. 5 Example sentence for the joint POS tagging and syntactic chunking task. 

[49]. An example sentence jointly labeled for these two outputs is shown in 
Figure 5 (under the BIO encoding). 

For Searn, there is little difference between standard sequence labeling 
and joint sequence labeling. We use the same data set as for the standard 
syntactic chunking task (Section 6.1.3) and essentially the same features. 
In order to model the fact that the two streams of labels are not indepen- 
dent, we decompose the problem into two parallel tagging tasks. First, the 
first POS label is determined, then the first chunk label, then the second 
POS label, then the second chunk label, etc. The only difference between 
the features we use in this task and the vanilla chunking task has to do 
the structural features. The structural features we use include the obvious 
Markov features on the individual sequences: counts of singleton, doubleton 
and tripleton POS and chunk tags. We also use "crossing sequence" fea- 
tures. In particular, we use counts of pairs of POS and chunk tags at the 
same time period as well as pairs of POS tags at time t and chunk tags at 
t — 1 and vice versa. 

6.1.5 Search and Initial Policies The choice of "search" algorithm in Searn 
essentially boils down to the choice of output vector representation, since, 
as defined, Searn always operates in a left-to-right manner over the output 
vector. In this section, we describe vector representations for the output 
space and corresponding optimal policies for Searn. 

The most natural vector encoding of the sequence labeling problem is 
simply as itself. In this case, the search proceeds in a greedy left-to-right 
manner with one word being labeled per step. This search order admits 
some linguistic plausibility for many natural language problems. It is also 
attractive because (assuming unit-time classification) it scales as 0{NL), 
where A^ is the length of the input and L is the number of labels, inde- 
pendent of the number of features or the loss function. However, this vector 
encoding is also highly biased, in the sense that it is perhaps not optimal for 
some (perhaps unnatural) problems. Other orders are possible (such as al- 
lowing any arbitrary position to be labeled at any time, effectively mimicing 
belief propagation); see [12] for more experimental results under alternative 
orderings. 

For joint segmentation and labeling tasks, such as named entity identi- 
fication and syntactic chunking, there are two natural encodings: word-at- 
a-time and chunk-at-a-time. In word-at-a-time, one essentially follows the 
"BIO encoding" and tags a single word in each search step. In chunk-at- 
a-time, one tags single chunks in each search step, which can consist of 
multiple words (after fixing a maximum phrase length) . In our experiments, 
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we focus exclusively on chunk-at-a-time decoding, as it is more expressive 
(feature- wise) and has been seen to perform better in other scenarios [48]). 

Under the chunk-at-a-time encoding, an input of length N leads to a 
vector of length N over MxL+1 labels, where M is the maximum phrase 
length. The interpretation of the first MxL labels, for instance (m, I) means 
that the next phrase is of length m and is a phrase of type I. The label 
corresponds to a "complete" indicator. Any vector for which the sum of the 
"m" components is not exactly N attains maximum loss. 

6.1.6 Initial Policies For the sequence labeling problem under Hamming 
loss, the optimal policy is always to label the next word correctly. In the 
left-to-right order, this is straightforward. For the segmentation problem, 
word-at-a-time and chunk-at-a-time behave very similarly with respect to 
the loss function and optimal policy. We discuss word-at-a-time because its 
notationally more convenient, but the difference is negligible. The optimal 
policy can be computed by analyzing a few options in Eq (18) 



It is easy to show that this policy is optimal (assuming noise- free training 
data). There is, however, another equally optimal policy. For instance, if yt 
is "in X" but y t -i is "in Y" (for X ^ Y), then it is equally optimal to select 
yt to be "out" or "in Y" . In theory, when the optimal policy does not care 
about a particular decision, one can randomize over the selection. However, 
in practice, we always default to a particular choice to reduce noise in the 
learning process. 

For all of the policies described above, it is also straightforward to com- 
pute the optimal approximation for estimating the expected cost of an ac- 
tion. In the Hamming loss case, the loss is if the choice is correct and 1 
otherwise. The computation for Fi loss is a bit more complicated: one needs 
to compute an optimal intersection size for the future and add it to the past 
"actual" size. This is also straightforward by analyzing the same cases as in 



6.1.7 Experimental Results and Discussion In this section, we compare 
the performance of Searn to the performance of alternative structured 
prediction techniques over the data sets described above. The results of this 
evaluation are shown in Table 1. In this table, we compare raw classification 
algorithms (perceptron, logistic regression and SVMs) to alternative struc- 
tured prediction algorithms (structured perceptron, CRFs, SVM struct s and 
M 3 Ns) to Searn with three baseline classifiers (perceptron, logistic regres- 
sion and SVMs). For all SVM algorithms and for M 3 Ns, we compare both 
linear and quadratic kernels (cubic kernels were evaluated but did not lead 
to improved performance over quadratic kernels). 




y t = begin X 

y t = in X and y t -\ G {begin X, in X} 
otherwise 



(18) 



Eq (18). 
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ALGORITHM 


Handwriting 


NER 


Chunk 


C+T 




Small 


Large 


Small 


Large 






CLASSIFICATION 














Percept ron 


65.56 


70.05 


91.11 


94.37 


83.12 


87.88 


Log Reg 


68.65 


72.10 


93.62 


96.09 


85.40 


90.39 


SVM-Lin 


75.75 


82.42 


93.74 


97.31 


86.09 


93.94 


SVM-Quad 


82.63 


82.52 


85.49 


85.49 






STRUCTURED 














Str. Perc. 


69.74 


74.12 


93.18 


95.32 


92.44 


93.12 


CRF 






94.94 




94.77 


96.48 


SVM struct 






94.90 








M 3 N-Lin 


81.00 












M 3 N-Quad 


87.00 












SEARN 














Perceptron 


70.17 


76.88 


95.01 


97.67 


94.36 


96.81 


Log Reg 


73.81 


79.28 


95.90 


98.17 


94.47 


96.95 


SVM-Lin 


82.12 


90.58 


95.91 


98.11 


94.44 


96.98 


SVM-Quad 


87.55 


90.91 


89.31 


90.01 







Table 1 Empirical comparison of performance of alternative structured predic- 
tion algorithms against Searn on sequence labeling tasks. (Top) Comparison for 
whole-sequence 0/1 loss; (Bottom) Comparison for individual losses: Hamming 
for handwriting and Chunking-I- Tagging and F for NER and Chunking. Searn is 
always optimized for the appropriate loss. 



For all SEARN-based models, we use the the following settings of the 
tunable parameters (see [12] for a comparison of different settings). We use 
the optimal approximation for the computation of the per-action costs. We 
use a left-to-right search order with a beam of size 10. For the chunking 
tasks, we use chunk-at-a-time search. We use weighted all pairs and costing 
to reduce from cost-sensitive classification to binary classification. 

Note that some entries in Table 1 are missing. The vast majority of these 
entries arc missing because the algorithm considered could not reasonably 
scale to the data set under consideration. These are indicated with a "~" 
symbol. Other entries are not available simply because the results we report 
are copied from other publications and these publications did not report all 
relevant scores. These are indicated with a "— " symbol. 

We observe several patterns in the results from Table 1 . The first is that 
structured techniques consistently outperform their classification counter- 
parts (eg., CRFs outperform logistic regression). The single exception is on 
the small handwriting task: the quadratic SVM outperforms the quadratic 
M N. For all classifiers, adding Searn consistently improves performance. 

An obvious pattern worth noticing is that moving from the small data 
set to the large data set results in improved performance, regardless of 

5 However, it should be noted that a different implementation technique was 
used in this comparison. The M 3 N is based on an SMO algorithm, while the 
quadratic SVM is libsvm [6]. 
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learning algorithm. However, equally interesting is that simple classifica- 
tion techniques when applied to large data sets outperform complicated 
learning techniques applied to small data sets. Although this comparison is 
not completely fair — both algorithms should get access to the same data — 
if the algorithm (like the SVM struct or the M 3 N) cannot scale to the large 
data set, then something is missing. For instance, a vanilla SVM on the 
large handwriting data set outperforms the M 3 N on the small set. Simi- 
larly, a vanilla logistic regression classifier trained on the large NER data 
set outperforms the SVM struct and the CRF on the small data sets. 

On the same data set, Searn can perform comparably or better than 
competing structured prediction techniques. On the small handwriting task, 
the two best performing systems are M 3 Ns with quadratic kernels (87.0% 
accuracy) and Searn with quadratic SVMs (87.6% accuracy). On the NER 
task, Searn with a perceptron classifier performs comparably to SVM struct 
and CRFs (at around 95.9% accuracy). On the Chunking-I- Tagging task, all 
varieties of Searn perform comparatively to the CRF. In fact, the only task 
on which Searn does not outperform the competing techniques is on the 
raw chunking task, for which the CRF obtains an F-score of 94.77 compared 
to 94.47 for Searn, using a significantly different feature set. 

The final result from Table 1 worth noticing is that, with the excep- 
tion of the handwriting recognition task, Searn using logistic regression 
as a base learner performs at the top of the pack. The SVM-based Searn 
models typically perform slightly better, but not significantly. In fact, the 
raw averaged perceptron with Searn performs almost as well as the logis- 
tic regression. This is a nice result because the SVM-based models tend to 
be expensive to train, especially in comparison to the perceptron. The fact 
that this pattern does not hold for the handwriting task is likely due to 
the fact that the data for this task is quite unlike the data for the other 
tasks. For the handwriting task, there are a comparatively small number of 
features which are individually much less predictive of the class. It is only 
in combination that good classifiers can be learned. 

While these results are useful, they should be taken with a grain of salt. 
Sequence labeling is a very easy problem. The structure is simple and the 
most common loss functions decompose over the structure. The compara- 
tively good performance of raw classifiers suggests that the importance of 
structure is minor. In fact, some results suggest that one need not actually 
consider the structure at all for some such problems [43,45]. 

6.2 Automatic Document Summarization 

Multidocument summarization is the task of creating a summary out of a 
collection of documents on a focused topic. In query-focused summariza- 
tion, this topic is given explicitly in the form of a user's query. The dom- 
inant approach to the multidocument summarization problem is sentence 
extraction: a summary is created by greedily extracting sentences from the 
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" ate . " " the man ate . " " the man ate a sandwich . " 



Fig. 6 An example of the creation of a summary under the vine-growth model. 

document collection until a pre-defined word limit is reached. [53] and [33] 
describe representative examples. Recent work in sentence compression [26, 
38] and document compression [13] attempts to take small steps beyond 
sentence extraction. Compression models can be seen as techniques for ex- 
tracting sentences then dropping extraneous information. They are more 
powerful than simple sentence extraction systems, while remaining train- 
able and tractable. Unfortunately, their training hinges on the existence of 
( sentence, compression ) pairs, where the compression is obtained from 
the sentence by only dropping words and phrases (the work of [56] is an 
exception). Obtaining such data is quite challenging. 

The exact model we use for the document summarization task is a novel 
"vine-growth" model, described in more detail in [12]. The vine-growth 
method uses syntactic parses of the sentence in the form of dependency 
structures. In the vine-growth model, if a word w is to be included in the 
summary, then all words closer to the tree root arc included. 

6.2.1 Search Space and Actions The search algorithm we employ for im- 
plementing the vine-growth model is based on incrementally growing sum- 
maries. In essence, beginning with an empty summary, the algorithm incre- 
mentally adds words to the summary, either by beginning a new sentence 
or growing existing sentences. At any step in search, the root of a new sen- 
tence may be added, as may any direct child of a previously added node. To 
see more clearly how the vine-growth model functions, consider Figure 6. 
This figure shows a four step process for creating the summary "the man 
ate a sandwich ." from the original document sentence "the man ate a big 
sandwich with pickles ." 

When there is more than one sentence in the source documents, the 
search proceeds asynchronously across all sentences. When the sentences 
are laid out adjacently, the end summary is obtained by taking all the green 
summary nodes once a pre-defined word limit has been reached. This final 
summary is a collection of subtrees grown off a sequence of underlying trees: 
hence the name "vine-growth." 
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6.2.2 Data and Evaluation Criteria For data, we use the DUC 2005 data 
set [11]. This consists of 50 document collections of 25 documents each; 
each document collection includes a human-written query. Each document 
collection additionally has five human-written "reference" summaries (250 
words long, each) that serve as the gold standard. In the official DUC eval- 
uations, all 50 collections are "test data." However, since the DUC 2005 
task is significantly different from previous DUC tasks, there is no a good 
source of training data. Therefore, we report results based on 10-fold cross 
validation. We train on 45 collections and test on the remaining 5. 

Automatic evaluation is a notoriously difficult problem for document 
summarization. The current popular choice for metric is Rouge [34], which 
(roughly speaking) computes n-gram overlap between a system summary 
and a set of human written summaries. In various experiments, Rouge has 
been seen to correspond with human judgment of summary quality. In the 
experiments described in this chapter, we use the "Rouge 2" metric, which 
uses evenly weighted bigram scores. 

6.2.3 Initial Policy Computing the best label completion under Rouge 
metric for the vine-growth model is intractable. The intractability stems 
from the model constraint that a word can only be added to a summary 
after its parent is added. We therefore use an approximate, search-based 
policy (see Section 3.4.2). In order to approximate the cost of a given par- 
tial summary, we search for the best possible completion. That is, if our goal 
is a 100 word summary and we have already created a 50 word summary, 
then we execute beam search (beam size 20) for the remaining 50 words 
that maximize the Rouge score. 

6.2.4 Feature Functions Features in the vine-growth model may consider 
any aspect of the currently generated summary, and any part of the input 
document set. These features include simple lexical features: word identity, 
stem and part of speech of the word under consideration, the syntactic 
relation with its parent, the position and length of the sentence it appears 
in, whether it appears in quotes, the length of the document it appears 
in, the number of pronouns and attribution verbs in the subtree rooted at 
the word. The features also include language model probabilities for: the 
word, sentence and subtree under language models derived from the query, 
a BayeSum representation of the query, and the existing partial summary. 

6.2.5 Experimental Results Experimental results are shown in Table 2. We 
report Rouge scores for summaries of length 100 and length 250. We compare 
the following systems. First, oracle systems that perform the summarization 
task with knowledge of the true output, attempting to maximize the Rouge 
score. We present results for an oracle sentence extraction system (Extr) 
and an oracle vine-growth system (Vine). Second, we present the results of 
the SEARN-bascd systems, again for both sentence extraction (Extr) and 
vine-growth (Vine). Both of these are trained with respect to the oracle 
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ORACLE 


SEARN 


BAYESUM 








Vine 


Extr 


Vine 


Extr 


D05 


D03 


Base 


Best 


100 w 


.0729 


.0362 


.0415 


.0345 


.0340 


.0316 


.0181 




250 w 


.1351 


.0809 


.0824 


.0767 


.0762 


.0698 


.0403 


.0725 



Table 2 Summarization results; values are Rouge 2 scores (higher is better). 



system. (Note that it is impossible to compare against competing structured 
prediction techniques. This summarization problem, even in its simplified 
form, is far too complex to be amenable to other methods.) For comparison, 
we present results from the BayeSum system [14,16], which achieved the 
highest score according to human evaluations of responsiveness in DUC 05. 
This system, as submitted to DUC 05, was trained on DUC 2003 data; the 
results for this configuration are shown in the "D03" column. For the sake 
of fair comparison, we also present the results of this system, trained in 
the same cross-validation approach as the SEARN-based systems (column 
"D05"). Finally, we present the results for the baseline system and for the 
best DUC 2005 system (according to the Rouge 2 metric). 

As we can see from Table 2 at the 100 word level, sentence extraction 
is a nearly solved problem for this domain and this evaluation metric. That 
is, the oracle sentence extraction system yields a Rouge score of 0.0362, 
compared to the score achieved by the Searn system of 0.0345. This differ- 
ence is on the border of statistical significance at the 95% level. The next 
noticeable item in the results is that, although the SEARN-based extraction 
system comes quite close to the theoretical optimal, the oracle results for the 
vine-growth method are significantly higher. Not surprisingly, under Searn, 
the summaries produced by the vine-growth technique are uniformally bet- 
ter than those produced by raw extraction. The last aspect of the results 
to notice is how the SEARN-based models compare to the best DUC 2005 
system, which achieved a Rouge score of 0.0725. The SEARN-based systems 
uniformly dominate this result, but this comparison is not fair due to the 
training data. We can approximate the expected improvement for having 
the new training data by comparing the BayeSum system when trained on 
the DUC 2005 and DUC 2003 data: the improvement is 0.0064 absolute. 
When this result is added to the best DUC 2005 system, its score rises to 
0.0789, which is better than the SEARN-based extraction system but not as 
good as the vine-growth system. It should be noted that the best DUC 2005 
system was a purely extractive system [59] . 

7 Discussion and Conclusions 

In this paper, we have: 

— Presented an algorithm, Searn, for solving complex structured predic- 
tion problems with minimal assumptions on the structure of the output 
and loss function. 
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— Compared the performance of Searn against standard structured pre- 
diction algorithms on standard sequence labeling tasks, showing that it 
is competitive with existing techniques. 

— Described a novel approach to summarization — the vine-growth method — 
and applied Searn to the underlying learning problem, yielding state- 
of-the-art performance on standardized summarization data sets. 

There are many lenses through which one can view the Searn algorithm. 

From an applied perspective, Searn is an easy technique for training 
models for which complex search algorithms must be used. For instance, 
when using multiclass logistic regression as a base classifier for Hamming 
loss, the first iteration of Searn is identical to training a maximum en- 
tropy Markov model. The subsequent iterations of Searn can be seen as 
attempting to get around the fact that MEMMs are trained assuming all 
previous decisions arc made correctly. This assumption is false, of course, in 
practice. Similar recent algorithms such a decision-tree-based parsing [55] 
and perceptron-based machine translation [32] can also be seen as running 
a (slightly modified) first iteration of Searn. 

Searn contrasts with more typical algorithms such as CRFs and M 3 Ns 
based on considering how information is shared at test time. Standard algo- 
rithms use exact (typically Viterbi) search to share full information across 
the entire output, "trading off" one decision for another. Searn takes an 
alternative approach: it attempts to share information at training time. In 
particular, by training the classifier using a loss based on both past ex- 
perience and future expectations, the training attempts to integrate this 
information during learning. This is not unsimilar to the "alternative ob- 
jective" proposed by [24] for CRFs. One approach is not necessarily better 
than the other; they are simply different ways to accomplish the same goal. 

One potential limitation to Searn is that when one trains a new classifier 
on the output of a previous iteration's classifier, it is usually going to be 
the case that previous iteration's classifier performs better on the training 
data than on the test data. This means that, although training via Searn 
is likely preferable to training against only an initial policy, it can still 
be overly optimistic. Based on the experimental evidence, it appears that 
this has yet to be a serious concern, but it remains worrisome. There are 
two easy ways to combat this problem. The first is simply to attempt to 
ensure that the learned classifiers do not overfit at all. In practice, however, 
this can be difficult. Another approach with a high computational cost is 
cross-validation. Instead of training one classifier in each Searn step, one 
could train ten, each holding out a different 10% of the data. When asked 
to run the "current" classifier on an example, the classifier not trained on 
the example is used. This does not completely remove the possiblity of 
ovcrfitting, but significantly lessens its likelihood. 

A second limitation, pointed out by [60], is that there is a slight dis- 
parity between what Searn does at a theoretical level and how Searn 
functions in practice. In particular, Searn does not actually start with the 
optimal policy. Even when we can compute the initial policy exactly, the 
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"true outputs" on which this initial policy are based are potentially noisy. 
This means that while ir is optimal for the noisy data, it is not optimal for 
the true data distribution. In fact, it is possible to construct noisy distri- 
butions where Searn performs poorly. 6 Finding other initial policies which 
are closer to optimal in these situations is an open problem. 

Searn obeys a desirable theoretical property: given a good classifica- 
tion algorithm, one is guaranteed a good structured prediction algorithm. 
Importantly, this result is independent of the size of the search space or 
the tractability of the search method. This shows that local learning — when 
done properly — can lead to good global performance. From the perspec- 
tive of applied machine learning, Searn serves as an interpreter through 
which engineers can easily make use of state-of-the-art machine learning 
techniques. 

In the context of structured prediction algorithms, Searn lies some- 
where between global learning algorithms, such as M 3 Ns and CRFs, and 
local learning algorithms, such as those described [43]. The key difference 
between Searn and global algorithms is in how uncertainty is handled. In 
global algorithms, the search algorithm is used at test time to propagate 
uncertainty across the structure. In Searn, the prediction costs are used 
during training time to propagate uncertainty across the structure. Both 
contrast with local learning, in which no uncertainty is propagated. 

From a wider machine learning perspective, Searn makes more apparent 
the connection between reinforcement learning and structured prediction. In 
particular, structured prediction can be viewed as a reinforcement learning 
problem in a degenerate world in which all observations are available at 
the initial time step. However, there are clearly alternative middle-grounds 
between pure structured prediction and full-blown reinforcement learning 
(and natural applications — such as planning — in this realm) for which this 
connection might serve to be useful. 

Despite these successes, there is much future work that is possible. One 
significant open question on the theoretical side is that of sample complex- 
ity: "How many examples do we need in order to achieve learning under 
additional assumptions?" Related problems of semi-supervised and active 
learning in the Searn framework are also interesting and likely to produce 
powerful extensions. Another vein of research is in applying SEARNto do- 
mains other than language. Structured prediction problems arise in a large 
variety of settings (vision, biology, system design, compilers, etc.). For each 
of these domains, different sorts of search algorithms and different sorts 
of features are necessary. Although Searn has been discussed largely as a 
method for solving structured prediction problems, it is, more generally, a 
method for integrating search and learning. This leads to potential applica- 
tions of Searn that fall strictly outside the scope of structured prediction. 

6 One can construct such a noisy distribution as follows. Suppose there is fun- 
damental noise and a "safe" option which results in small loss. Suppose this safe 
option is always more than a one step deviation from the highly noisy "optimal" 
sequence. Searn can be confused by this divergence. 
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